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Abstract 

We propose a family of very efficient hierarchical indexing schemes for ungapped, score matrix- 
based similarity search in large datasets of short (4-12 amino acid) protein fragments. This type 
of similarity search has importance in both providing a building block to more complex algorithms 
and for possible use in direct biological investigations where datasets are of the order of 60 million 
objects. Our scheme is based on the internal geometry of the amino acid alphabet and performs 
exceptionally well, for example outputting 100 nearest neighbours to any possible fragment of length 
10 after scanning on average less than one per cent of the entire dataset. 

Keywords: Similarity search; Indexing; Protein fragments; Quasi-metrics 

1 Introduction 



Indexing of biological sequences for fast similarity search has received a lot of attention in recent years 
ll26ll25i r7l [T7ll35ll28ll5T l. While many schemes were proposed for indexing DNA sequences, much less 
effort has been spent on protein sequences. This is understandable because the DNA datasets are many 
times larger and offer potentially more information. However, similarity search in protein datasets differs 
in a few important aspects from the corresponding DNA based search: the protein alphabet is larger and 
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Health, Bethesda, MD 20894, United States of America. 



1 



y 


Amino apiH alnhahf^t 


in 


Pt"q frm o n 1 1 f^n n"l"ri 


Y 

.A 


1 id^lllCllL Uu.Lu.dCL. 


yi 


Siyp of \' ii^iiJillv not Icnown PYJiPtlv hpforph^inrl 


f 
J 


yjuery luncuon on . ji^xj — 2^,-^o Ji\^i)- 


f. 

Ji 


/-in coiiiponcni oi uie cjuery lunciion. 




Ranof^ niif^r\/ \x/ith fpcnppt to f w/ith thrpcholH P* all "\7 \^ Qiipti that f{'\7 \ <^ P 

IVUll^C LILlCiy WlLll iCSUCCL LxJ J WlLll LlliCSllUlVJ. C. Ull V ^i- LiiuL / \ V / ^ C. 




/C-ilCdiCSL IICI^IIUULIIS mJciy Willi iCSpcLL LU J . A, pUlllLS llUlll WlLll LllC SlllclllcSL VdlUCS Ul J . 


S 


Similarity score on Z'": s{x^y) = Y!l!^o^ ^{^iiyi) 


S 


Similarity score matrix over L: a map 6 : L x L ^ M. 


d 


A quasi-metric distance (over L but also used in general context). 


D 


Distance matrix over L: a map £) : L x L ^ M+. 


P 


A metric symmetrization of the quasi-metric d. 


w 


Weight function lor a co-weightable quasi-metric space. 




A iibre: subset or 12 with constant weight. 




Reduced alphabet at the i-th position. 




r rojeciion — > i ^ ai ine /-in position. 




Projection function: maps each fragment into its bin. 


7 


A letter in a reduced alphabet. 




Set of all bins. 


N 


Total number of bins: = UfJo \^r\- 




Integer value of a letter of the reduced alphabet at the j-th position. 


r 


Bin ranking function: index into the bin array. 
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Lower bound to the query function over bins: F{B) = J^I'Sq Fi{Bi) 



Table 1: Summary of symbols. 



the similarities between any two individual amino acids are not all identical but are given instead by a 
score matrix such as a member of the PAM [14] or BLOSUM [22] family. 

In this paper we focus not on the whole protein sequences but on their fragments. We take as our 
datapoints the set of all peptide fragments of fixed length (also known as q-grams in the more general 
context of string matching |[52l ) taken from a protein sequence database. Our similarity measure is 
global: the similarity score is given by the sum of the scores of amino acids at the same position, that 
is, by extension of a similarity measure on the amino acid alphabet. As the fragments we investigate are 
quite short (length 4-12), we do not allow gaps. 

Our motivation is twofold. Firstly, there are reasons to believe that similarity search in peptide 
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fragments can become an important tool for function discovery. Short peptide motifs can form parts of 
larger functional domains and often have important physiological function on their own |[36ll29l . We are 
interested in large-scale searches for biologically 'interesting' short peptide motifs. In that case, a search 
algorithm becomes an important subroutine, to be followed by filtering of search hits using additional 
criteria, since raw similarity score usually does not provide sufficient grounds for assumption of sequence 
homology for short protein fragments. 

Secondly, sets of short peptide fragments provide a very convenient object of study in the context 
of designing general indexing schemes for similarity search. Those datasets are quite sizable but not 
excessively large and are equipped with a simple and computationally cheap similarity measure. We 
have already used an early implementation of one of the indexing schemes in a family described here 
(FS Index) as an illustration for general indexing concepts [44]. 

The family of indexing schemes we present here, called FSIndex ('Functional Sequence Index'), 
is based on the idea that amino acids can be grouped according to their chemical properties and that 
these groupings are, albeit imperfectly, reflected in almost all similarity measures used in practice. Pre- 
computing distances (or more generally, query functions) from amino acids to clusters allows efficient 
computation of the lower bounds for the distance between query and database fragments and pruning of 
a large proportion of the dataset that can be certified not to belong to a similarity-based query. Unlike 
with the most other proposed indexing schemes, the hierarchical traversal of the search space is per- 
formed using an implicit, combinatorially generated tree, so that only the data points and the clustering 
information need to be stored. 

2 Background 

2.1 Sets of protein fragments 

We model protein fragments of fixed length m as strings of length m generated by the standard 20 amino 
acid alphabet £ and denote the set of all such strings by We write x = xqX\ . . .Xm^\ for each x G 
Obviously, the same approach can be used for any sets of strings of fixed length generated by a finite 
alphabet. A protein fragment dataset or instance (which we will usually denote by X) is therefore a 
subset of Z*". 

To describe similarity-based queries, we define a similarity measure on as a function / : — > M 
such that f{x) = L/lo/iC-"^')' where /,• are functions £ ^ M for all / G /„, (/„, = {0, 1, . . . ,m — 1}). A range 
query based on / with range e, denoted 2™^(£), retrieves all points x € X such that f{x) < £. A kNH (k 
nearest neighbours) query, denoted Qf^^{k), retrieves k points from X that have the smallest value of /. 
We call the queries based on / the half-plane or valuation queries. 

The above setting generalises the similarity queries based on a dissimilarity measure, that is, a dis- 
tance function. Indeed, if we have a distance : x ^ M+ and a point co G Z™, retrieving all points 
X GX such that (3f(ft),x) < e is equivalent to setting up a range query where /(x) =d{(0,x). While 

in the context of this paper we will mostly use distances to generate queries, there are practically relevant 
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cases where the more general formulation is desirable. 

In the context of biological sequences, similarity scores that are large and positive for similar se- 
quences and small and negative for dissimilar ones, are most commonly used. For sequences of arbitrary 
lengths, Needleman-Wunsch [41] and Smith- Waterman [49] algorithms are used to compute global and 
local similarity scores respectively, depending on the score matrix S over the alphabet and on gap penal- 
ties. Both algorithms take the time at least quadratic in the length of sequence and can also be used to 
compare the sequences of fixed length. 

Our choice however, is an ungapped similarity measure, depending solely on the score matrix, be- 
cause gaps rarely appear in alignments of fragments of short length (the penalty for opening gaps is 
usually set to be large compared to penalties for substitutions of letters). For two fragments x,y G Z™, 
we have s{x,y) = L'-'lo' '^i^i^yi) a range query centred at ft) G with cutoff threshold t retrieves all 
points X & X such that s{(0,x) > t. The fragment similarity measure described above has a significant 
advantage in that it can be computed in time linear in m, the length of fragments. 

Position Specific Score Matrices (PSSMs) [19] provide a generalisation of score matrix based simi- 
larity scores that produce more accurate search results. Instead of using the same score matrix S at each 
position / G /,„, a different score function is used. PSSMs are constructed from sets of aligned sequences 
classified according to some property and aie used primarily to detect similarities that cannot be found 
using score matrices. An iterative procedure is very commonly used [[H: a 'seed' query sequence is 
provided and the first search is performed using a score matrix. The results of that search are then used 
to construct a PSSM and another search is performed based on it. The search can be further refined as 
necessary by using the results of one search to construct the PSSM used in the subsequent query. 

The importance of PSSMs in biological applications motivates our choice of the similarity measure 
on r™ as a function of the form f{x) = TJiLofii^i) since a PSSM cannot be expressed as a standard score 
matrix. Here the functions ft are given by the columns of the PSSM. 

2.2 Quasi-metrics 

Let II be a set. A map of two arguments, d . Q.x Q. is called a quasi-metric if it satisfies the 

properties: (i) for allx,j G d{x,y) =d{y,x) = 4^ x = y, and (ii) for aHx,y,z G d{x,z) < d{x,y) + 
d{y,z) (triangle inequality). It is called a metric if in addition it satisfies d{x,y) = d{y,x) for all x,y G H 
(symmetry). Quasi-metrics form an area of active research in general topology ll30l but the knowledge 
of the geometry associated with them is still relatively scarce. 

Quasi-metrics are relevant to our investigation because of their relation to similarity scores. It can 
be verified 11501 that most of the matrices from the BLOSUM [22] family, the most widely used family 
of score matrices over the amino acids, restricted to the standard amino acid alphabet, produce a quasi- 
metric under the transformation D{x,y) = S(x,x) — S{x,y). In particular, the matrices BLOSUM 45, 50 
60, 62 (Figure [B, 65, 80, 90 and 100 do so while BLOSUM 30, 35, 40, 55, 70 and 75 do not. As the 
triangle inequality is violated only once for BLOSUM 55, 70 and 75, we suspect that its violation may 
be due to round-off error. No member of the also widely used PAM [ 14] family of score matrices can be 
transformed into quasi-metrics in this way but the number of failures of the triangle inequality is small 
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Figure 1: BLOSUM62 similarities (left) and quasi-metric distances (right). Distances within members 
of an alphabet partition used for constructing an index for fragments of length 9 used in experiments are 
grayed. 



for at least some of them, indicating that they can be closely approximated by quasi-metrics. 

The above transformation canonically extends to fragments: the distance function given by d{x,y) = 
s{x,x) — s{x,y) is then also a quasi-metric. Its main significance is that every range query centred at (O 
with threshold t based on similarity score s coincides with the range query centred at (O with respect to 
the quasi-metric d of radius e = s{o), ft)) — t. 

Example 2.1. Consider a space of fragments of length 3 from the alphabet £ = {a,b,c,d} where the 
similarity score matrix S is given in Figure |2](a). Using the transformation D(a, t) = S(a, a) — S{cs, x) 
we obtain the distance matrix D (Figure [2] (b)). 
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Figure 2: An example of a similarity score (a) and the corresponding distance matrix (b). 
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It can be verified that D is indeed a quasi-metric on £. Consider the fragments x = abd, y = cad and 
z. = ebb. Using similarity scores, we have s{x,x) = S{a,a) +S{b,b) +S{d,d) = 16, s(x,y) = S{a,c) + 
S{b,a) +S{d,d) = 5, s{x,z) = S{a,c) +S{b,b) +S{d,b) = 10, while using distances, d{x,y) = D{a,c) + 
D{b,a) +D{d,d) = 11 and d{x,z) =D{a,c) +D{b,b) +D{d,b) = 6. We can verify that d{x,y) = s{x,x) — 
s{x,y) dLndd{x,z) = s{x,x) — s{x,z)- 

Now consider a range similarity query about x with the threshold 9, that is, the query retrieving all 
points u such that s{x, u) > 9. The point z above belongs to this query while y does not. Since s{x,x) = 16, 
we can transform this query to a query with respect to d of radius 16 — 9 = 7. Using our general notation 
we represent this query by 2™^(7) where f{u) = d{x,u). 

If a similarity score matrix is symmetric (as all of BLOSUM and PAM matrices are), the cor- 
responding quasi-metric is co-weightable OTTl . that is, it satisfies d{x,y) +w{y) = d{y,x) +w(x) for 
all x,y G where w{x) = s{x,x) is called the weight function. Therefore, we can write d{x,y) = 
p{x,y) + ^{w{x) —w{y)), where p is a metric given by p{x,y) = ^{d{x,y) +d{y,x)) (a symmetrisation 
of the quasi-metric d). 

It can be easily shown that the asymmetry in a co-weightable quasi-metric is solely due to the weight 
function not being constant; indeed, a co-weightable quasi-metric is a metric if and only if the weight 
function is constant on the whole underlying set. More generally, let Q.{z) denote the set {x £Q.: w{x) = 
z}. We call for any z € w(n) a. fibre of Q.. It has been shown ||54ll that every co-weightable quasi- 
metric can be decomposed into a pairwise disjoint union of fibres, where each fibre is a metric space. 
Furthermore, every quasi-metric range query can be decomposed into a disjoint union of metric range 
queries on fibres. 

Lemma 2.2. Let X be a set and d a co-weightable quasi-metric on X with weight function w. Let p 
be a metric symmetrising d given by p{x,y) = ^ {d{x^y) + d{y^x)) for all x,y G X. Denote by *B(x, £) 
the set {j G n : d{x,y) < e} and by T>{x,e) the set {j G H : p{x,y) < e}. For x ^ and z G M, let 
5{x,z) = ^{z — w{x)). Then, for all x G il, and £ > 

^ix,e)= □ D(x,£ + 5(x,z))|n(z). 

zew{Q.) 

Proof. Since Q. clearly decomposes into a disjoint union of fibres, we can write 

^{x,e)= □ ^{x,e)\a{z). (1) 
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Now fix z and suppose y ^Q.{z). Then w{y) =z and hence 

P{x,y) = ]^{d{x,y)+d{y,x)) 

= d{x,y) + ]^[w{y)-w{x)) 

= d{x,y) + ]^{z-w{x)) 
= d{x,y) + 5{x,z). 

Therefore, ^{x, e) \Q.{z) = 3 (x, e + 5{x,z)) \^{z) and our resuh follows by ([T])- □ 

The sets *B (&),£) and T){(0,e) defined in Lemma |2!2] correspond, respectively, to the quasi-metric 
and metric queries of radius e centered at ft). Therefore, metric access methods (see below) can be used 
to index datasets for similarity with respect to a co-weightable quasi-metric. 

2.3 Metric Trees 

The majority of published indexing schemes for similarity search apply to datasets where the dissimilar- 
ity measure is a metric (this includes the special case of access methods for vector spaces) |9l l23ll45l . 
In most cases we encounter a hierarchical tree index structure where each inner node of the tree is asso- 
ciated with a set covering a portion of the dataset and a certification function certifying if the query ball 
does not intersect the covering set, in which case the node is not visited and the whole branch is pruned. 
The data points are stored in the leaf nodes that are retrieved and sequentially scanned only if necessary. 
Certification functions are mostly based on distances from points; the triangle inequality ensures that no 
members of the dataset satisfying the query are missed. For a more general setting, we refer the reader 
to our theoretical exposition [44]. 

Metric trees are usually balanced, providing (at least in theory) scalability with respect to the number 
of data points. However, it is known that hierarchical indexing structures typically suffer from the 'Curse 
of Dimensionality'. As the dimension of the dataset grows, index performance deteriorates and is often 
worse than the performance of an optimised sequential scan Q. This effect has been linked with the 
so-called phenomenon of concentration of measure on high-dimensional structures Il43l . which is also 
found in many diverse areas of mathematics and physics Il20ll33l . 

In this paper, we use two of the well known metric access methods, the M-tree |[T2l and the mvp-tree 
to index short peptide fragment datasets and to compare their performance against our own indexing 
scheme FSIndex. 

2.3.1 M-tree 

M-tree 112] is a dynamic, paged structure that can be used to index all metric datasets. It is one of the 
best generic indexing structures for metric spaces available, with many improvements added over the 
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years lITBlfTTI . A reference C++ implementation is publicly available. Memory requirements are modest 
as most of the indexed points are kept on hard disk and retrieved as needed, at a cost of additional I/O 
accesses. 

The search tree is binary and at each node a routing object (a point from the dataset) is stored together 
with the radius of the ball covering all the dataset points under that node. The certification function cal- 
culates the distance from the query centre to the routing object and decides whether the query intersects 
the covering ball. The way the routing points are chosen and data points divided between them at con- 
struction time is determined by the user's choice of one of many available split policies. Ciaccia et al. 
|[T2]| found that the best performing split policies were based on generalised hyperplane decomposition 
where each data object is assigned to the routing object closest to it. However, such policies may be very 
computationally expensive (quadratic in the number of data points). 

2.3.2 mvp-tree 

The vp-tree (vantage point tree) 1 57 1 and the mvp-tree (multiple vantage point tree) f6^ access methods 
partition the datasets using the distances from vantage points: points with a distance from a vantage point 
smaller than the median distance go to one branch, those with larger to the other. This procedure ensures 
a balanced tree. The mvp-tree is a modification of the vp-tree that uses multiple vantage points at each 
node of the search tree. For example, binary mvp-tree uses two instead of three vantage points needed 
by vp-tree to divide a covering region into four regions, resulting in fewer distance computations. 

2.4 Suffix arrays 

Trie, suffix tree and suffix array data structures form the basis of many of the established string search 
methods and provide the foundation for some features of the FSIndex access method described in Section 

m 

A trie |fT6l is an ordered tree structure for storing strings having one node for every common prefix of 
two strings. The strings are stored in extra leaf nodes. A PATRICIA tree (Practical Algorithm to Retrieve 
Information Coded in Alphanumeric [38]) is a compact representation of a trie where all nodes with one 
child are merged with their parent. Tries and PATRICIA trees can be used for exact matching of strings 
and take linear time in the length of the query string. 

Now consider a single (long) string t of length m. The suffix tree ll55l for t is the PATRICIA tree of 
the suffixes of t and can be constructed in 0{m) time |[55ll37ll53l . Suffix trees, in their original form as 
well as generalised to suffixes of more than one string, can be used to solve a great variety of problems 
involving matching substrings of long strings 1,21.1 . 

Suffix trees in general occupy 0{m) space, where the constant may be large depending on the appli- 
cation Il2ni32l . The suffix array data structure, first proposed by Manber and Myers [34], is a compact 
representation of the suffix tree for t consisting of the array pos, of integers in the range /„, specifying 
the lexicographic ordering of suffixes of t, and the array Icp, where lcp[i\ contains the longest common 
prefix of the substrings starting at positions pos[i— 1] and pos\i\. Efficient 0{m) construction algorithms 
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exist and using binary search on array pos and the Icp values, it is possible to search for occurrence of a 
string pint in 0{n + logm) time, where n is the length of p ||2T]1 . 

PATRICIA trees (and hence suffix trees and arrays), being compact representations of a set of strings, 
can also be used to accelerate non-exact searches [18], by partially evaluating the similarity function at 
each inner node at pruning those branches where it exceeds the given range. 

3 FSIndex 

FSIndex is an access method for short peptide fragment workloads mainly based on two procedures: 
amino acid alphabet reduction and combinatorial generation. 

For very short fragments (lengths 2-4), the number of all possible fragment instances is very small 
(for length 3, 20^ = 8000) and almost every fragment instance generated exists in the dataset. Hence, it 
is possible to enumerate all neighbours of a given point in a very efficient and straightforward manner 
using digital trees or even hashing. For larger lengths, the number of fragments in a dataset is generally 
much smaller than the number of all possible fragments and generation of neighbours is not feasible. If 
it were to be attempted, most of the computation would be spent generating fragments that do not exist 
in the dataset. Hence the idea of mapping peptide fragment datasets to smaller, densely and, as much 
as possible, uniformly packed spaces where the neighbours of a query point can be efficiently generated 
using a combinatorial algorithm. 

Partitions of the amino acid alphabet provide the means to achieve the above. Amino acids can 
be classified by chemical structure and function into groups such as hydrophobic, polar, acidic, basic 
and aromatic. Such classification appears in every undergraduate text in biochemistry and has been 
previously used in sequence pattern matching ||48]| . In general, substitutions between the members of 
the same group are more likely to be observed in closely related proteins than substitutions between 
amino acids of markedly different properties. The widely used similarity score matrices such as PAM 
||T4] or BLOSUM [22J are derived from target frequencies of substitutions and therefore capture these 
relationships more precisely. 

The required mapping is constructed as follows. Given a set of fragments of fixed fragment length 
Z™, an alphabet partition 71,- : £ ^ is chosen for each position / G !,„, where |F,| < |r|, canonically 
inducing the mapping tt : — ^ where ^ = Fq x Fi x . . . x F^-i. We call the members of ^ bins 
or blocks and denote their number by A'^. An important consequence of such mapping is that the lower 
bounds of the value of the query function on each bin are easy to compute. 

Example 3.1. Consider the alphabet and similarity measures from Example 12. 11 A logical partition of E, 
based on the distance matrix D, is into sets a = {a,c} and j3 = {b,d}. Hence, for fragments of length 
m = 3 with equal partitions at each position, we have A'^ = 2^ = 8 bins. 

It should be noted that the partitions tt, can, but are not required to, be equal for each /. Different 
partitions at the different positions need to be allowed in order to have a finer control on the number 
of bins. For example, consider fragments of length 12. If we take 5 partitions at each position, the 
total number of bins would be 5^^ 244 million while for 6 partitions we have 6"' ^ 2176 million. If 
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the size of our dataset lies somewhere in between these numbers, using 6 partitions would result in too 
many empty bins (and may be infeasible due to space requirements) while we can further improve on 5 
partitions by having 6 partitions at only some positions. 

3.1 Data structure 

The FSIndex data structure consists of three arrays: frag, bin and lap. The array frag contains pointers 
to each fragment in the dataset and is sorted by bin. The array bin, of size N is indexed by the rank of 
each bin and contains the offset of the start of each bin in frag. The bin ranking function r : =^ — > /yv is 
defined as follows. For each / G /,„ let r, : F,- be a ranking function of F,- and define i^,- : F,- ^ N by 

m-l 

m=ri{y)Y\\Tj\. (2) 
In the case i = m — \ the empty product above is taken to be equal to 1. Then, for any B = BqB\ . . . B^-i € 

m, 

m— 1 

r(B)=£5(B,). (3) 

;=o 

In addition, each bin is sorted in lexicographic order and the value of Icp^ provides the length of the 
longest common prefix between /rag[/] and frag[i — 1]. The value of lcp[Q\ is set to 0. Figure |3] depicts 
an example of the full structure of an FSIndex. 

The frag and Icp arrays are very similar to suffix arrays but the order of offsets in frag is different 
because frag is first sorted by bin and then each bin is sorted in lexicographic order. Sorting frag 
within each bin and constructing and storing the Icp array is not strictly necessary and incurs a space and 
construction time penalty. The benefit is improved search performance for large bins, compensating for 
unbounded bin sizes. In effect, each bin is subindexed using a compact version of a PATRICIA tree. 

3.2 Index construction 

The construction algorithm (Algorithm l3.1l) is closely related to counting sort ||46l. It makes three passes 
over data fragments: to count the number of fragments in each bin, to insert the fragments into the frag 
array and to compute the Icp array. 

The fragment dataset is in practice always obtained from a full sequence dataset by iterating over all 
subfragments of length m from each sequence and it is often necessary to verify each fragment and reject 
those that contain non-standard letters such as 'X', 'B' or 'Z' that do not represent actual amino acids 
and violate the triangle inequality for the score matrices. Therefore, the true number of data points is not 
known exactly before the first pass through the dataset. 

The space requirement of FSIndex is @{n+N). The exact time complexity of the construction 
algorithm depends on the sorting algorithm used for sorting the frag array. The implementation shown 
in Algorithm l3. 1 l uses quicksort [24] and hence the total construction time is 0{n+N+n\ogn) on average 
and 0{n + N + n^) in the worst case. 
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Figure 3: Structure of an FS Index of a dataset of fragments of length 4 from the alphabet £ = 
{A,B,C,D,E,F}. The same alphabet reduction is used at each position, mapping {A,B} to 0, {C,D} to 
1 and {E,F} to 2. 



3.3 Search 

Search using FSIndex is based on traversal of implicit trees whose nodes are associated with reduced 
fragments (bins). 

Let B = BqB I . . .Bm-i £ For each A: G/m and r^,, denote by B{k, 7) the sequence Bq . . .Bic-iyBk+i . . .Bm-i- 
For every / G /„, denote by Tbj the tree having the root B connected to the subtrees 7b(^^)j.+i for 
^ = /, / + 1 , . . . , m — 1 and 7 € Fj. \ {B^} and by Tb the tree o- 

The trees Tbj are connected and unbalanced and can be shown to have depth m — i while the root 
has the degree 'Y!k=l iT/tl — 1. The topology of Tb depends only on the sizes of the alphabets F, and not 
on any particular 5 e If |Fo| = |Fi | = . . . = |Fm-i \= K,Tb is isomorphic to the multinomial tree of 
order (m,K). If K = 2, such tree is called the binomial tree of order m. An example is shown in Figure 
111 

It is easily established that for any B ^ SB there exists a bijection between the nodes of Tb and the set 
SS, that is, traversal of Tb enumerates SS. It can also be easily seen that the nodes at depth k from the root 
have Hamming distance k from the root. 

3.3.1 Range queries 

Retrieval of a half-plane range query 2™^(e) using the implicit tree structure is conceptually straightfor- 
ward. Define F : ^ ^ M by F(B) = l,'"=o^ Fi{Bi) where Fiiy) = min«e^/(a) so that F gives the lower 
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Algorithm 3.1: CONSTRUCTFSlNDEX(a) 



X — fragment dataset (size n) 

tmp — auxiliary array used to keep track of bin sizes 

comment: First pass to get size of each bin 

for y^OtoA^+1 

do tmp[j] ^ 
for each s eX 

\tmp[k + 2] ^ tmp[k] + 1 

comment: Second pass to insert fragments into bins 

for 7^2toA^+l 

do tmp[j] ^ tmp[j] + tmp[j - 1] 
for each s eX 

(k^r{n{s)) 
do < frag[tmp[k+ I]] ^ s 

[tmp[k+ 1] ^ tmp[k+ 1] + 1 

comment: Sort each bin 
for y ^OtoA^- 1 



\QuiCKSoRT(/rag[/mp[;] : tmp[j + l]]) 

bins[N] =tmp[N] 

comment: Compute the longest common prefixes 

lcp[0]^0 

for / ^ 1 to « — 1 

'x^ frag[i-l] 
y ^ frag \i] 

do < ^ ry,^ 

while Xj = jj 

do;^; + l 
Jcp[i\^i 
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aaa 




Figure 4: An example of Tb where B = aaa € Fq x Fi x F2, Fq = {a,b}, Fi = {a,b,c,d}, F2 = {a,b,c}. 



bound for / over each bin. Let Z,- = argmin^£p ,/^(7) and let Z = ZqZi . . .Z^-i G We traverse the 
tree Tz from the root, calculating at each node B the value of F{B) and pruning the subtree rooted at B if 
F{B) > e. For every visited node that is not pruned, we sequentially scan the associated bin and collect 
all the fragments x € such that f{x) < e. 

By the choice of Z, we have F{Z) < F{B) for all B ^ Furthermore, since each branch in Tz 
involves a change of single letter at a position, it is easy see that for each child C of B, F{B) < F{C). 
This implies that if F{B) > e, pruning the subtree rooted at B will not lose any hits. Therefore, FSIndex 
provides what we call a consistent indexing scheme ll44l . 

Our implementation of range search first processes (sequentially scans) the root bin Z and then per- 
forms a recursive, depth-first traversal of the implicit tree Tz using the function CheckNode (Algorithm 
13.21 ) with the initial arguments u = r(Z). D = and / = 0. The values of ^^-(7), min{Fyt(7) | 7eFj-\{Zi:}} 
and ^kiv) — ^ki^k) are precomputed for all k and all 7 before search making each evaluation of Chec- 
kNode very fast. All computations involve the ranks of reduced fragments rather than the fragments 
themselves because the bin array is indexed by rank. The function ProcessBin (Algorithm 13.31) scans 
each bin associated with a node that is not pruned. It uses the Icp array to reduce the number of compu- 
tations of the function /. 

Example 3.2. Continuing with the fragment space and partitions described in Examples 12.11 and 13.11 
suppose we need to retrieve the query 2™^(7) where /(v) = d{x,v) for x = abd. In this case, /{{y) = 
D{xi, 7) for / = 0, 1 , 2. We first compute F,-, as shown: 
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For example, Fi (a) =mm{D{b,a),D{b,c)} = 8. 
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Algorithm 3.2: CheckNode(m,£),0 



for y <— m — 1 downto / 

f if D + min {Fj{y) | 7 e T,- \ {Zj}] < e 
fforeacli 7Gr^ \{Z;} 



do < 



then < 



do < 



\iE < e 

then < ProcessBin(v) 

^ CheckNode (v,E,j + \) 



Algorithm 3.3: ProcessBin(m) 

HL — list of hits 

CD — array of length m + 1 



n <— bin[u + 1] — bin[u\ 
ifn = 
then return 

CD[0] ^ 



for i 



do < 



to n - 1 

s^frag[u + i\ 

for j ^ lcp[u + i] to lcp[u + / + 1] — 1 

AoCD[j+l]^CD[j]+fj{sj) 
\iCD[lcp[u + i+l\]<e 

for j <— lcp[u + j + 1] to m — 1 
then doCD[j+\]^CD[j]+fj{sj) 
MfC£>[m] <£ 

then InsertHit {HL,s,CD[m]) 



Clearly, Z = ajip (the bin containing x) and F{Z) = so Z must be scanned. The children of Z 
are the bins j3j3j3, aap and apa having the values of F 7, 8 and 8 respectively. Hence, we prune the 
subtrees originating at aaj3 and ajia and scan the bin j3j3j3. Both children of /3j3j3 have F greater than 
7 so the algorithm stops here. 
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3,3.2 WN queries 

Our kNN search algorithms use branch-and-bound |[T2ll23i traversal involving initially setting the radius 
£ to a very large number, inserting first k data points encountered into the list of hits and then setting e to 
be the largest value of / for all hits. From then on, if a point having the value of / less than e is found, it 
replaces the hit with a previously largest value of / and the new value of e is computed. Eventually, the 
value of £ is reduced to the exact value necessary to retrieve k nearest neighbours. 

We implement the branch-and-bound procedure using a priority queue (heap) returning the largest 
value of / for the list of hits. Most of the code for range search can be reused: it is only necessary to 
change the InsertHit routine used by the ProcessBin function. 

3.4 Arbitrary fragment lengths 

In most practical situations, fragment datasets are datasets of suffixes of full sequences. The FSIndex 
structure as is can be used without modifications for answering queries longer than m, the original length: 
each fragment of length m is a prefix of a suffix of length m' where m' > m. To search with a query of 
length m', traverse the search tree using the first m positions and sequentially scan all the bins retrieved, 
using all m' positions to calculate the value of /. If m' > m, the few fragments of length m at the end of 
each full sequence can be identified and ignored at the sequential scan step. 

Similarly, FSIndex can be used to answer queries of length m" where m" < m. At the construction 
step, insert all suffixes, including those of length less than m into the index by mapping each fragment x 
such that \x\ = m" < m, into the bin ni{xi)7l2{x2) ■ ■ ■ 7im"{^m")7m"+i ■■■Ym, where Y,„"+i,. . . ,ymwe, chosen 

so that i§m"+l(7m"+l) = ^m"+2{ym"+l) = ■■■ = <§m(7«i) = 0. 

To answer a query of length m", traverse the search tree up to the depth m" and sequentially scan 
all the bins attached to subtrees rooted at the accepted nodes using first m" positions to evaluate /. The 
ranking function given by the Equations [2] and [3] ensures that the bins that are the children of a given 
node are adjacent in the frag array. 

4 Performance of FSIndex 

This section describes the experiments on actual fragment datasets carried out to evaluate the perfor- 
mance of FSIndex. Three main classes of tests were conducted investigating general performance, scal- 
ability and effects of similarity measures. 

Each experiment consisted of 5000 searches using randomly generated queries. The queries were 
produced by first generating several random sequences and then taking a sample of non-overlapping 
fragments. Each full sequence was constructed using Dirichlet mixtures ETl [HI, with the code ob- 
tained from |htt p : //www . cse . ucsc. eciu/research/compbio/dirichlets/[ the length of the se- 
quence was modeled by a discretised log-normal distribution 1421 and the amino acids of a sequence 
are generated by an independent, identically distributed process with the the probabilities coming from 
a mixture of Dirichlet densities. According to the authors of [i47il . using Dirichlet mixtures gives a better 
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model of protein sequences than using a single distribution based on background frequencies of amino 
acids. 

The main measures of performance are the number of bins and dataset fragments scanned in order to 
retrieve k nearest neighbours. The principal reason for expressing the results in terms of the number of 
nearest neighbours retrieved rather than the radius is that it allows comparison across different indexing 
schemes, datasets and similarity measures. Furthermore, most existing protein datasets are strongly non- 
homogeneous and the number of points scanned in order to retrieve a range query for a fixed radius 
varies greatly compared to the number of points scanned in order to retrieve a fixed number of nearest 
neighbours. However, it should be noted that most experiments involve range search algorithms, which 
are generally more efficient. 

The final statistic is the percentage of residues (letters) scanned out of the total number of residues 
in all scanned fragments. It measures the effect of sub-indexing each bin using the suffix-array-like 
structure involving partially scanning each fragment with a help of the Icp array. 

FSIndex creation and search algorithms were implemented in the C programming language. The 
experiments described in 14 .2 [ 1431 14.41 and 14.51 were were performed on a Sun Fire[tm] 280R server (733 
MHz CPU, 2 GB of memory) under Solaris. The experiments described in 14. 6 1 and 14.7 . 21 were performed 
on an Intel(R) XEON [tm] 2.2 GHz CPU under Linux. Performance of M-tiee (14.7.11) was tested on Sun 
Fire[tm] 6800 servers (1.2 GHz CPU) under Solaris. All testing programs were compiled with the native 
compiler (Sun compiler on Solaris, gcc on Linux) with maximum optimisation flags. 

4.1 Datasets and indexes 

We used two publicly available protein sequence datasets as sources for our overlapping fragment datasets: 
NCBI nr (non-redundant - version from June 2004 containing 1,866,121 sequences consisting of 619,474,291 
amino acids) |[56l and SwissProt (Release 43.2 of April 2004, containing 144,731 sequences consisting 
of 53,363,726 amino acid residues) ||5l. 

The experiments investigating general performance and effect of different similarity measures used 
fragment datasets derived from SwissProt. The scalability experiments used, in addition to SwissProt, 
the datasets nrOlSK, nr036K, nr072K, and nr288K, obtained by randomly sampling 18, 36, 72 and 288 
thousands of sequences respectively from the nr dataset (SwissProt fills the gap because it contains about 
150,000 sequences). Table |2] describes the datasets and alphabet partitions used for the experiments. 

In most cases, the partitions of the amino acid alphabet were chosen as a result of practical consid- 
erations based on the BLOSUM62 quasi-metric (Figured]) distances. The main requirement was that the 
distances from amino acids to clusters be as large as possible, ensuring a high lower bound on distances 
from any possible query point to any partition but its own. While this can be made into a combinato- 
rial optimization problem (for example, by maximizing the expected distance from an amino acid to a 
cluster based on the relative frequencies of amino acids), the number of possible partitions of a set of 20 
elements is extremely large and the exact solution of the problem is infeasible. Instead, the partition for 
the index SPEQ06 was produced by clustering amino acids that were close according to the BLOSUM62 
quasi-metric and the partitions for the indexes SPECj09 and SPEQ12 were obtained by coarsening it by 
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Index 


Dataset 


Partitions 


Fragments 


Bins 


bpace usage 
(Mb) 


SPEQ06 


SwissProt 


T,SA,N,ILV,M,KR,DE,q,WF,Y,H,G,P,C 


53486349 


7529536 


283 


SPEQ09 


SwissProt 


TSAN , ILVM , KR , DEQ , WFYH , GPC 


53478888 


10077696 


293 


SPEQ12 


SwissProt 


TSAN , ILVM , KRDEQ , WFYHGPC 


53472161 


16777216 


318 


nr01809 


nrOlBK 


TSAN , ILVM , KR , DEQ , WFYH , GPC 


6005750 


10077696 


67 


nr03609 


nr036K 


TSAN , ILVM , KR , DEQ , WFYH , GPC 


11911191 


10077696 


95 


iirO7209 


nr072K 


TSAN , ILVM , KR , DEQ , WFYH , GPC 


23878523 


10077696 


152 


nr28809 


iir288K 


TSAN , ILVM , KR , DEQ , WFYH , GPC 


95593618 


10077696 


494 


SPNA09 


SwissProt 


KR,q,E,D,N,T,SA,G,H,W,Y,F,P,C,ILV,M 

KR,q,ED,N,T,SA,G,HW,YF,P,C,ILV,M 

KR,QED,N,TSA,G,HW,YF,P,C,ILVM 

KR , QEDN , TSA , G , HWYF , PC , ILVM 

KR , QEDN , TS A , G , HWYFPC , ILVM 

KR , QEDN , TSAG , HWYFPC , ILVM 

KRQEDN , TSAG , HWYFPC , ILVM 

KRQEDN , TSAG , HWYFPCILVM 

KRQEDNTSAG , HWYFPCILVM 


53478888 


10483200 


294 


SPNB09 


SwissProt 


KR , QEDN , TSA , G , HWYF , PC , ILVM 
KR , QEDN , TSA , G , HWYF , PC , ILVM 
KR , QEDN , TSA , G , HWYF , PC , ILVM 
KR , QEDN , TSA , G , HWYF , PC , ILVM 
KR , QEDN , TSA , G , HWYFPC , ILVM 
KR , QEDN , TSAG , HWYFPC , ILVM 
KR , QEDN , TSAG , HWYFPC , ILVM 
KRQEDN , TSAG , HWYFPC , ILVM 
KRQEDN , TSAG , HWYFPCILVM 
KRQEDNTSAG , HWYFPCILVM 


53476582 


8643600 


287 



Table 2: Instances of FSlndex used in experimental evaluations. The last two digits of the index name 
denote the length of reduced fragments. The indexes SPNA09 and SPNB09 use non-equal partitions at 
different positions (all shown) while the remainder were constructed using one partition for all positions 
(only one shown). Space usage denotes the space used by index only, excluding the dataset itself. 

merging 'close' clusters. 

The additional requirement we considered was to have as few empty bins as possible. We attempted 
to keep the sizes of clusters in terms of their relative frequencies close to balanced and the number of 
partitions per residue was decreased with fragment length. For that reason, some amino acids having 
very small overall frequencies, such as tryptophan ('W') and cysteine ('C'), had often to be clustered 
with amino acids that were very distant from them. 

The alphabet partitions from Table|2]agree with the 'biochemical intuition', that is, with the chemical 
properties of amino acids. For example, the clusters outlined in Figure [T] used for fragments of length 
9 approximately correspond to polar uncharged, hydrophobic, basic, acidic, aromatic and 'other' amino 
acids. The partition used for the fragments of length 12 is obtained by merging together acidic and 
basic as well as aromatic and 'other' clusters. An interesting fact is that in this case each of the the four 
clusters has a relative frequency very close to |. This however, does not guarantee balanced bin sizes: in 
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all cases the distributions of bin sizes were strongly skewed in favour of small sizes (Figure [5] shows one 
example) with many empty but also a few very large bins. Such distributions appear to follow the DGX 
distribution, a generalisation of Zipf-Mandelbrot law proposed by Bi, Faloutsos and Korn Bl . 



(0 

o • 
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BIN SIZE 

Figure 5: Distribution of SPEQ09 bin sizes (2,342,940 empty bins out of 10,077,696). 

As can be seen from Table |2l the number of bins in most cases is close to 10 million, giving on 
average, in the case of the SwissProt dataset, about 5.3 points per bin. The indexes SPEQ06 and SPEQ12 
have somewhat fewer and more bins, respectively, because a closer number could not be reached with 
the same partitions at the same position. It was practically infeasible to test the basic case where each 
amino acid is given its own partition because of the large memory requirement of FSIndex in this case. 
Even for length 6, this would result in 20^ = 64 million bins, many of which would be empty. The large 
number of bins also means that storing the index on disk would result in too many page accesses and 
excessive runtimes. 

The indexes SPNA09 and SPNB09 used position specific partitions in order to test the performance 
of unequal partitions. The choice was mainly based on biochemical categorisations, merging 'close' 
partitions if needed. The index SPNA09 used larger numbers of partitions at the first few positions while 
the numbers of partitions for SPNB09 was more balanced. 

It should be noted that it was indeed necessary to use alphabet clusters approximately corresponding 
to their biological characterisations (and hence to most similarity measures used in practice). Our pre- 
liminary investigations using random partitions showed that a search in that case is essentially reduced 
to sequential scan - too expensive to be used for comparison. 

4.2 General performance 

Figure [6]presents selected statistics of search experiments for fragment lengths 6,9 and 12 respectively, 
consisting in each case of range queries retrieving 1, 10, 50, 100, 500 and 1000 nearest neighbours with 
respect to the BLOSUM62 -based quasi-metric. For each length, kNN searches were performed prior to 
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Figure 6: Performance of FS Index for fragments of length 6 (a-c), 9 (d-f) and 12 (g-i). 
First column (a,d,g): average number of bins scanned. Second column (b,e,h): average 
number of fragments scanned. Third column (c,f,i): percentage of residues scanned (out 
of total number of residues in fragments scaimed). 
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Figure 7: Median radius of a query retrieving k nearest neighbours from the SwissProt fragment dataset 
of length 6 (a), 9 (b) and 12 (c) with respect to the BLOSUM62 quasi-metric. 



range searches using the index that was expected to be the fastest in order to determine the search ranges 
for each random query fragment. The median query radii required to retrieve k nearest neighbours are 
shown in Figure [T] 

4.3 Scalability 

Figure H] shows the results of a set of experiments involving instances of FSIndex based on datasets 
of fragments of length 9 of different sizes (nrOlSK, nr036K, nr072K, SwissProt and nr288K). All 
indexes used the same alphabet partition (Table [2ll and all queries were based on the BLOSUM62-based 
quasi-metric. 

4.4 Access overhead and running time 

Figures |9]and[T0]give a summary of the results of Subsections 14. 2l and 14.3 I for all combinations of indexes 
and fragment lengths available. Figure |9] shows the average access overhead, that is, the average ratio 
between the number of fragments scanned and the number of true neighbours retrieved. Figure [TOl shows 
the average search times. Range search algorithm and the BLOSUM62-based quasi-metric were used in 
all cases. 

4.5 Range versus fcNN searches 

Figure [TT] shows the relative overhead of the branch-and-bound ^NN search algorithm as compared to 
the range search algorithm. It gives the ratio between the number of bins retrieved for /cNN and range 
searches for the best-performing indexes for each of fragment lengths 6, 9 and 12. 
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Figure 8: Performance of FSIndex for fragment datasets of length 9 of different sizes: 
(a) Scalability. Each Une depicts a different number of nearest neighbours; (b) Mean 
number of bins scaimed; (c) Percentage of residues scaimed (out of total number of 
residues in fragments scanned). 
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Figure 9: Average access overhead of searches using FSIndex. 
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Figure 10: Average running time of searches using FSIndex. 
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Figure 1 1 : Mean ratio between the number of bins retrieved for kNN and range searches using best- 
performing indexes for fragment lengths 6, 9 and 12. 
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4.6 Dependence on similarity measures 

To investigate the difference in performance for different BLOSUM matrices, we ran range queries re- 
trieving 100 nearest neighbours of query fragments of length 9 using the best-performing index for length 
9 (SPEQ09). We also performed searches using the PSSMs constructed for each query fragment from the 
results of a BLOSUM 62-based 100 NN search in order to gain some impression about usability of FSIn- 
dex for queries involving PSSMs that are likely to arise during biological applications. Since our PSSM 
construction routines were written in the Python programming language, we used a Python program for 
PSSM searches, with the FSIndex C routines embedded. Table [3] presents a summary of the results. 



Score Matrix 


Bins (%) 


Fragments (%) 


Residues (%) 


kNN Ratio 


Average time (ms) 


BLOSUM45 


0.13 


0.15 


56.9 


1.51 


20 


BLOSUM50 


0.13 


0.14 


57.1 


1.50 


19 


BLOSUM62 


0.13 


0.15 


57.0 


1.50 


20 


BLOSUM80 


0.14 


0.16 


57.2 


1.48 


21 


BLOSUM90 


0.15 


0.19 


57.2 


1.48 


25 


PSSM 


0.11 


0.12 


52.0 


9.95 


16 



Table 3: Performance of the FSIndex SPEQ09 with different similarity measures. The values shown are 
averages based on 100 NN queries of length 9. The columns denote the similarity measure (matrix), per- 
centages of bins, fragments and residues (as before the percentage is out of the total number of residues in 
scanned fragments) scanned, the ratio between the number of bins retrieved for kNN and range searches 
and the average search time. 

4.7 Comparisons with other access methods 

The final set of experiments compares FSIndex with M-tree, mvp-tree and a suffix array based access 
method. In general, these access methods take significantly more space and time compared to FSIndex 
and it was therefore necessary to restrict the comparisons to small datasets and queries retrieving fewer 
neighbours. 

4.7.1 M-tree 

Recall that M-tree is a paged metric access method that stores the majority of the structure in secondary 
memory, usually on hard disk. This is in contrast with the implementations of FSIndex, mvp-tree and 
suffix arrays used here, which store the whole index structure in primary memory. Hence, although M- 
tree occupies large amounts of space, most of the costs are associated with the secondary memory, which 
is much less expensive in terms of price. On the other hand, I/O costs, not considered here, can be quite 
large. 
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M-tree was tested as a part of the FMTree indexing scheme that allows use of metric indexing 
schemes for retrieval of quasi-metric queries. FMTree was constructed by splitting the dataset into fibres 
and indexing each fibre separately using an instance of M-tree that was created using the BulkLoading 
algorithm of Ciaccia and Patella [,10.| . The implementation of M-tree was obtained from its authors' web 
page http : //www-db . deis .unibo . it/Mtree/ index.html 
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Figure 12: Performance of FMTree based on M-tree on a dataset of fragments of length 10. Median and 
worst case results for 100 random queries are shown. Error bars show the interquartile range. 

The dataset in this experiment was the set of 1,753,832 unique fragments of length 10 obtained from a 
5000 protein sequence random sample taken from SwissProt (Release 41.21). An FMTree was generated 
for BLOSUM62-based quasi-metric at a cost of 34,142,940 distance computations. Figure [T2] shows the 
results based on 100 random queries. 

4.7.2 Suffix arrays and mvp-tree 

Table |4] presents the results of comparisons between FSIndex (^NN and range search algorithm), suffix 
array and mvp-tree over the datasets of fragments of length 6 and 9 from nrOlSK. The similarity mea- 
sure used was the associated metric to the BLOSUM62-based quasi-metric given for each a;, j G by 
p{x,y) = max{d{x,y) ,d{y,x)}. We used a metric rather than a quasi-metric because construction of an 
FMtree based on mvp-tree would require significant modifications to the original code and we observed 
that the performance of FSIndex does not much differ if a quasi-metric is replaced by its associated met- 
ric. If the mvp-tree showed good performance on metric workloads, the next step would be to split the 
datasets into fibres to create an FMTree for quasi-metric searches. 
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Instances of suffix array were constructed using tlie code from|http : //www . cs . dartmouth . edu/'^doug/sarray/ 
Tlie searcli algoritlim was identical to tlie Algorithm 13.31 where the input is a single bin containing all 
fragments in the dataset. In order to construct an instance of mvp-tree, duplicate fragments in the datasets 
were collected together and the sets of unique fragments provided to the mvp-tree construction algorithm. 
We modified the mvp-tree implementation developed by the original authors of mvp-tree |6] provided to 
us by Marco Patella for use with protein fragments. The maximum size of a leaf node was set to be 5. 
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15.0 


0.1 


9.9 


0.1 


20395.1 


62.8 


8377.6 


4.8 
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10 


12.1 


0.2 


7.1 


0.1 


3809.0 


84.1 


6614.3 


38.8 


9 


1 


1869.7 


1.6 


1303.6 


1.3 


73374.7 


135.9 


1021975.1 


689.0 


9 


10 


902.7 


4.5 


615.5 


3.3 


14975.5 


188.5 


214246.4 


1491.4 



Table 4: Comparison of performance of FSIndex, suffix array and mvp-tree. The table shows the values 
of the average effective access overhead and the average running time for each access method. The 
overhead measures the number of characters (residues) accessed in order to retrieve a given number of 
nearest neighbours, normalised by the fragment length and the number of retrieved neighbours. The 
statistics are in terms of characters rather than data points because suffix array search algorithm passes 
by each point but only computes the distances if necessary. 



5 Discussion 

While the experiments presented in the previous section covered few datasets and a small proportion of 
possible parameters for FSIndex creation, it can still be observed that FSIndex performed well according 
to many theoretical benchmarks as well as the actual search times. Not only did it perform much better 
than the other indexing schemes tested but it has proven itself to be very usable in practice: it does not 
take too much space (5 bytes per residue in the original sequence dataset plus a fixed overhead of the bin 
array), considerably accelerates common similarity queries and the same index can be used for multiple 
similarity measures without significant loss of performance. In the remainder of the current section we 
will examine some salient features of the experimental results. 

5.1 Performance as power law 

The most striking feature of Figures [6lfT0l is the apparent power-law dependence of the number of bins 
scanned and number of fragments scanned on the number of actual neighbours retrieved, manifesting as 
straight lines on the corresponding graphs on the log-log scale. For each index, the slopes of the graphs 
of running time, bins scanned and fragments scanned are very close, implying that the same power law 
governs the dependence of all three variables on the number of neighbours retrieved. The exponents are 
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0.81 for length 6, between 0.57 and 0.63 for length 9, and about 0.45 for length 12. While a rigorous 
theory, especially in the context of quasi-metrics, is still missing, we believe that the exponent is related 
to the 'dimensionality' of the index. For more detailed explanation we refer the reader to our theoretical 
exposition on indexing schemes l.44il . 

5.2 Effect of subindexing of bins 

PATRICIA-like subindexing of bins was introduced in order to accelerate scanning of bins containing 
many duplicate or highly similar fragments. Figures [6] (c,f,i) and [8] (c) show that there are two main 
factors influencing the proportion of residues scanned out of the total number of residues in the fragments 
belonging to the bins needed to be scanned: the (average) size of bins and the number of alphabet 
partitions at starting positions. Instances of FSIndex having many partitions at first few positions perform 
well (SPEQ06, SPNA09), those that have few partitions with many letters per partition, less so. 

Clearly, if a bin has a single letter partition at its first position, the distance at that position need be 
only retrieved once, at the start of the scan, independently of the number of fragments the bin contains. 
The effects for the second and subsequent positions are less prominent, if only for the reason that using 
many partitions would result in many bins being empty. The actual composition of the dataset is also 
important, as Figure [8] (e) attests: although same partitions are used and nr0288K is almost twice as 
large, SPEQ09 scans fewer characters. The possible reason lies in the nature of SwissProt, which, as a 
human curated database, is biased towards the well-researched sequences which are more related among 
themselves while not necessarily being representative of the set of all known proteins. On the other hand, 
nr0288K is a random sample from the nr database which is exactly the non-redundant set of all known 
proteins. 

In our evaluations, the average proportion of residues scanned per bin varies from 30% (SPEQ06, 
length 6) to over 85% (nr018K, length 9). The percentage of characters scanned grows slowly with 
increase of the number of neighbours retrieved, most probably because the number of bins accessed also 
grows, requiring that at least one full sequence is scanned. 

To summarise, subindexing of bins does produce some savings, the exact amount depending on 
the dataset and alphabet partitioning. However, and this is further attested by poor performance of the 
suffix array compared to FSIndex (Table |4|, the good performance of FSIndex is mostly due to alphabet 
partitioning. 

5.3 Effect of similarity measures 

Table |3] indicates very little difference in performance of the same instance of FSIndex with respect to 
different similarity measures. This should not be a surprise because the BLOSUM matrices are indeed 
very similar, modeling the same phenomenon in slightly different ways but generally retaining the same 
groupings of amino acids. The PSSM-based searches also performed well, mainly because the PSSMs 
are usually constructed out of sets of sequences that are strongly conserved at least in one or two posi- 
tions, and hence, in those positions, the 'distances' to all other clusters are so large that many branches 
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of the implicit search tree can be pruned. 

5.4 Scalability 

Unlike both M-tree and mvp-tree, as well as most other known metric or spatial access methods ||9l 
|23l|45], FSIndex does not have a balanced tree. Therefore, the expected average and worst-case search 
time complexity is 0{n + K) - the overhead is proportional to K, the number of inner nodes. Based 
on these considerations, it appears that FSIndex is not scalable for queries of a fixed radius. However, 
the performance can be to a large extent controlled by the choice of alphabet partitions and hence some 
scalability can be achieved by using more partitions for larger datasets in order to reduce the scanning 
time while incurring some additional overhead. 

Figure [8] (b) indicates that FSIndex is scalable with respect to the number of nearest neighbours 
retrieved - the number of residues needed to be scanned grows sublinearly with dataset size (in fact, 
the exponent is 0.25 to 0.3). The exponent for the growth of the number of scanned points (graphs not 
shown here) is about 0.4, indicating that using PATRICIA-like structure improves scalability. The most 
likely reason for sublinear growth of the number of items needed to be scanned is that the search radius 
needed to retrieve a fixed number of nearest neighbours decreases with dataset size, requiring fewer bin 
accesses. 

5.5 Comparison with other indexing schemes 

Results of Subsection I4.7l indicate that FSIndex decisively outperforms all other indexing schemes con- 
sidered. M-tree performed the worst, needing to scan 1.3 million fragments of length 10 in order to 
retrieve the nearest neighbour. The performance of mvp-tree is not much better, taking into account 
the dimensionality: it requires scanning about 1 million fragments of length 9 to retrieve the nearest 
neighbour. Suffix array was generally performing better than mvp-tree, except for retrieving the nearest 
neighbour of length 6. 

In the case of suffix arrays, it is clear that large alphabet and relatively small dataset are responsible 
for relatively poor performance. Also note that suffix trees (and hence suffix arrays) generally are not 
good approximations of the geometry with respect to ^i-type distances - two fragments lacking a com- 
mon prefix may have a small distance. It should be noted that performance of suffix array based scheme 
appears to improve with fragment length compared to FSIndex. 

The poor performance of M-tree and mvp-tree is somewhat surprising because Mao, Xu, Singh and 
Miranker 1*351 have recently proposed using exactly M-tree for fragment similarity searches. However, 
on closer inspection, several differences appear. First, Mao, Xu, Singh and Miranker use a different 
metric. More importantly, they use a significantly improved M-tree creation algorithms. Finally, if their 
results are compared with those from Figure[T2](this can be done at least approximately because the same 
fragment length was used and the size of the yeast proteome dataset used in f35l was very close to the size 
of SwissProt sample used in our experiment), it appears that there is no more than 10-fold improvement. 
While this is quite significant, the total performance appears stiU worse than that of FSIndex. For more 
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detailed comparisons it would be necessary to obtain the code of the improved M-tree from ||35]| and run 
a fuU suite of comparison experiments. 

An interesting possibility would be to try and combine our approach with the Multi Resolution String 
(MRS) index structure for string datasets proposed by Kahveci and Singh [27] in a general context and 
further adapted for protein database searches, cf. e.g. 0. The MRS scheme is based on mapping 
a string dataset to a lower-dimensional space of sequences of integers by means of the first few Haar 
wavelet transform coefficients. In the context of protein sequence databases, the first wavelet coefficient 
corresponds to the well-known concept of the aminoacid content of a fragment, while the second coeffi- 
cient is the difference between the aminoacid content vectors of the first and the second halves of a given 
fragment. The MRS scheme applies to string datasets in any possible alphabet, and does not seem to 
capture the intrinsic geometry of the alphabet itself. For this reason, it appears to us that this approach 
and ours are "complimentary" to each other, and combining them in the context of datasets of protein 
fragments could result in a further performance gain. 

5.6 Range vs kNN queries 

Performance of the branch-and-bound algorithm depends on the order of nodes visited - it is to a great 
advantage if the nodes containing data points closest to the query are visited first so that the bounding 
radius becomes small early on. A frequently used solution |[T2l l23l is to traverse the tree breadth-first, 
keeping the nodes to be visited in a second priority queue, where the priority of a node is given by the 
upper bound of the distance of its covering set from the query. 

The second priority queue is not used for the FSIndex based kNN search. Since the implicit tree is 
heavily unbalanced, the branches with smallest depth are visited first with a similar effect without the 
overhead of the second priority queue. The visiting order of nodes is ensured in the outer loop of the 
CheckNODE function where the index j starts at m — 1, decreasing to / (Algorithm 13.21 ). Since the order 
does not affect the range search performance, the same code can be used for range search. 

While queries based on more than one similarity measure can be used on a single FSIndex, it is to 
be expected that similarity measures different from the one originally used to determine the partitions 
would have worse performance. 

5.7 Applications to general protein sequence search 

General protein sequence search involves retrieval of sequences based on gapped similarity measure. 
Retrieval based on sequential scan with respect to the Smith- Waterman local similarity score 1*491, the 
most commonly used similarity measure, is infeasible for large datasets and a number of heuristic search 
tools were proposed, the most famous being the NCBI BLAST [IJ. 

To perform a database search, BLAST takes the query sequence and divides it into very short frag- 
ments (length 3 is common for proteins). It then searches for close matches of these fragments to the 
fragments of the database sequences. When a match is found, it attempts to extend this 'seed' in order to 
produce a longer, gapped alignment. 
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Recently, there have been a number of attempts to use indexing schemes on DNA or protein frag- 
ments so as to increase the length of seeds and hence accelerate the search Q [171 [35l |5T1. Efficient 
performance of FSIndex also recommends it for such purpose, by taking seeds of length 6-12. By taking 
longer seeds, the total number of seeds can be reduced, leading to less time being spent extending seeds 
that are subsequently rejected. We leave the follow-up on this idea for a subsequent investigation. 

6 Conclusion 

The efforts of the data engineering community in constructing indexing schemes for similarity-based 
information retrieval have been largely concentrated either on generic indexing schemes which would 
work for an arbitrary metric (or normed) space |[T2l |9l IH |45l, or else on q-grams or suffix tree-based 
schemes for string similarity HI |39l |40l |26l |25l. Metric space indexing schemes are often subject to 
the curse of dimensionality (as witnessed in a part of our present investigation), while the suffix tree- 
based schemes are very efficient but may suffer from large use of memory, although we are seeing good 
attempts towards removing this limitation |[26l |25]| 

The FS index scheme proposed in our paper seems to explore the geometry of datasets of short protein 
fragments in a very fortunate way. We believe that the efficiency of the scheme can be further improved, 
but already in its present form it can be put to a number of uses. We intend to use FSindex in the first 
stage of searching the complete protein fragment datasets for functional motifs. We believe that alphabet 
reduction can be used in other indexing methods, in particular in large suffix trees f25l if some space 
reduction is desired at a very little loss in performance. Another promising application is to use FSindex 
for seeds in local similarity search tools. 

The proposed indexing scheme has a theoretical significance. In B4l we have developed a general 
approach to building new indexing schemes from old by applying a number of fundamental operations, 
and explained how a a prototype version of the FSindex scheme can be obtained in this way from the 
so-called projective reduction to a suffix tree. It is conceivable that very efficient indexing schemes can 
be engineered in the future in the same manner by properly combining a number of very simple tools. 

We believe that FSindex is a successful realization of this idea. In particular, we have made public a 
new software tool for assigning biological function to protein fragments, called PFMFind, based on FSin- 
dex as a low-level core component. The source code is available at http : //www . vuw . ac . nz/biodiscovery/Publicat 
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